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Abstract 

Kernel Bayes’ rule has been proposed as a nonparametric kernel-based method 
to realize Bayesian inference in reproducing kernel Hilbert spaces. However, we 
demonstrate both theoretically and experimentally that the prediction result by 
kernel Bayes’ rule is in some cases unnatural. We consider that this phenomenon 
is in part due to the fact that the assumptions in kernel Bayes’ rule do not hold in 
general. 
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§1 Introduction 

Kernel Bayes’ rule has recently emerged as a novel framework for Bayesian infer¬ 
ence |[T]-[^. It is generally agreed that, in this framework, we can estimate the ker¬ 
nel mean of the posterior distribution, given kernel mean expressions of the prior and 
likelihood distributions. Since the distributions are mapped and nonparametrically ma¬ 
nipulated in infinite-dimensional feature spaces called reproducing kernel hilbert spaces 
(RKHS), it is believed that kernel Bayes’ rule can accurately evaluate the statistical fea¬ 
tures of high-dimensional data and enable Bayesian inference even if there were no 
appropriate parametric models. To date, several applications of kernel Bayes’ rule have 
been reported However, the basic theory and the algorithm of kernel Bayes’ rule 
might need to be modified because of the following reasons: 

1. The posterior in kernel Bayes’ rule is in some cases completely unaffected by the 
prior. 
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2. The posterior in kernel Bayes’ rule considerably depends upon the choice of the 
parameters to regularize covariance operators. 

3. It does not hold in general that conditional expectation functions are included in 
the RKHS, which is an essential assumption of kernel Bayes’ rule. 

This paper is organized as follows. We begin in ^with a brief review of kernel Bayes’ 
rule. In ^ we theoretically address the three arguments described above. Numerical 
experiments are performed in ^ to confirm the theoretical results in ^ ® we 

summarize the theoretical and experimental results and present our conclusions. Some 
of the proofs for ^and ^are given in ^ 

§2 Kernel Bayes’ rule 

In this section, we briefly review kernel Bayes’ rule following Q. Let X and 2/ be 
measurable spaces, {X, Y) be a random variable with an observed distribution P on 
X x }f, 17 be a random variable with the prior distribution 11 on X, and (Z, IT) be a 
random variable with the joint distribution Qon X x }f. Note that Q is defined by the 
prior n and the family {P\)\x \ x € <Y}, where Py\x denotes the conditional distribution of 
Y given X - x. For each y e 2/, let Qx\y represent the posterior distribution of Z given 
W = y. The aim of kernel Bayes’ rule is to derive the kernel mean of Qx\y 

Definition 2.1. Let kx and ky be measurable positive definite kernels on X and 2/ 
such that E[kx{X,X)] < oo and E[k;^{Y, T)] < cxd, respectively, where £■[•] denotes the 
expectation operator. Let 'Hx and ‘TFy be the RKHS defined by kx and k^, respectively. 
We consider two bounded linear operators Cyx '■ ^ and Cxx '■ ’T^x ^ ’T^x 

such that 

{g,CYxf)<Hx-E[f{X)g{Y)\ and {fyCxxfi)<H^ = E[fi{X)f 2 {X)\ (1) 

for any /,/i ,/2 € Elx and g e ‘T-fy, where (•, and (•, denote inner products on 
Plx and El\), respectively. The integral expressions for Cyx and Cxx are given by 

(Cyxf)(-)= f ky(-,y)f(x)dP(x,y) and (CxxfX-) = f kx(-,x)f(x) dPx(x), 
Jxxy Jx 

where Px denotes the marginal distribution of X. Let Cxy be the bounded linear opera¬ 
tor defined by 

{f,Cxyg}ny = E[f{X)g{Y)] 

for any / e Plx and g € Ely. Then Cxy is the adjoint of Cyx- 

Theorem 2.2. ( Theorem 1 ) IfE[g(Y) | X = •] £ Ely for g € Ely, then CxxE{g{Y) \ X 
•] ^ Cxyg. 
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Definition 2.3. Let Qy denote the marginal distribution of W. Assuming that E[kxiU, U)] < 
oo and E[k^{W, VK)] < oo, we can define the kernel means of fl and by 

mn = E[kxi-, U)] and niQ^ = E[ky{-, IL)], 

respectively. Due to the reproducing properties of E(x and 'Kj/, the kernel means satisfy 
(/, = E[fiU)] and {g, = E[g{W)] for any / € -Hv and g € -Kj/. 

Theorem 2.4. ( Theorem 2) IfCxx A injective, mu e Ran(Cxx). E[g{Y) \ X - 
•] e Elxfor any g € T-fy, then 


= CyxCxx^n, 


( 2 ) 


where RanfCxx) denotes the range of Cxx- 
Here we have, for any x e X, 


E [k^{; L) I X = ^] = CyxCxlckxi; x) (3) 

by replacing mu in Equation Q for kx{-, x). It is noted in Q that the assumption niu e 
Ran(Cxx) does not hold in general. In order to remove this assumption, {Cxx+^I)~^ has 
been suggested to be used instead of , where e is a regularization constant and I is 
the identity operator. Thus, the approximations of Equations Q and Q are respectively 
given by 

niQy = Cyx {Cxx + ef) ' win and Y)\X = x\ = Cyx (Cxx + hx{-, x). 

Similarly, for any y € d/, the approximation of is provided by 

mQl = E'-^s [kx{;Z) I IT = y] - Czw (Cww + k^(;y), (4) 

where d is a regularization constant and the linear operators Czw and Cww will be 
defined below. 

Definition 2.5. We consider the kernel means mg = m^zw) and m^ww) such that 

(niizw), ^ C [/(Z)g(lT)] and {m^ww), g\®g2)<Hnmy = E [gi(W)g 2 (lL)] 

for any / e Elx and g,gi,g 2 e El\), where ® denotes the tensor product. Eet C(yx)x ■ 
Elx ® Elx and C(yy)x '■ Elx —> El\) ® ‘Kj/ be bounded linear operators which 

respectively satisfy 


® f, C(yx)xh)<H^mx - E [g{Y)fiX)h{X )], 
(^1 Ciyy)xf)^,mx ^ E[gfY)g2{Y)f{X)\ 

for any f,heElx and g,gi,g 2 ^ El^. 
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From Theorem 2.4 Fukumizu et al. Q proposed that /7i(zvr) and m{yvw) can be given 


by 

m{zw) - C{YX)xC^x^^ ^ ® - C{yy)xC^x^'^ ^ 

In case mji is not included in Ran(Cxz), they suggested that ni{zw) and m^ww) could be 
approximated by 

^[zw) ~ ^(yx)x (Cxx + mu and - C(yy)x (Cxx + e/) ^ mu- 

Remark 2.6. ( |j2|, page 3760) m(zw) and m{yvw) can respectively be identified with 
Czw and Cww- 

Here, we introduce the empirical method for estimating the posterior kernel mean 
f^Qx\y following 0. 

Definition 2.7. Suppose we have an independent and identically distributed (i.i.d. ) 
sample {(X/, from the observed distribution F on <Y x 3/ and a sample 

from the prior distribution H on X. The prior kernel mean mu is estimated by 

i 

mn = ^ 7jkxi-, Uj), ( 6 ) 

y=i 

where yi,..., y/ are weights. Let us put win ^ (mu{Xi),mu(Xn)f, Gx = ikx{Xi,Xj))i<ij<n, 
and Gy ~ ^ 7 ))i</, 7 <«- 

Proposition 2.8. ( Proposition 3, revised) Let In denote the identity matrix of size 
n. The estimates of Czw ^^d Cww given by 

n n 

Czw = f^lJ^ikx{-,Xi) ® Yj) and Cww = ^i) ® 

r=l ;=1 

respectively, where p = (JIi,... ,'pnV = (Gx + nel„)~^mu- 

The proof of this revised proposition is given in 
Equation Q can be empirically estimated by 

- Czw [Cww + Cwwk:^{-,y)- 

Theorem 2.9. ( Proposition 4) Given an observation y € J/, can be calcu¬ 
lated by 

= klRx\YkYiy), Rx\y = XGy ({XGy? + 5l)' A, 

where X = AiSiglli) is the diagonal matrix with the elements of Ji, kx = {kx{-,X\),... ,kx{-,XnyV, 
and kY ^ (k^i-, Ti),... F„))^. 

If we want to know the posterior expectation of a function / € ‘TTv given an obser¬ 
vation y € d/, it is estimated by 

(f’mQ^f)<H^ = f\Rx\Yk\){y), 

where = {f{Xi ),... ,/(X„))^. 


6.1 It is suggested in Q that 


4 




§3 Theoretical arguments 


In this section, we theoretically support the three arguments raised in ^ First, we show 
in 5 3.1 that the posterior kernel mean mgyi, is completely unaffected by the prior distri¬ 
bution n under the condition that A and Gy are non-singular. This implies that, at least 


in some cases, 11 does not properly affect Second, we mention in q3.2| that the 

linear operators Cxx and Cvrw are not always surjective, and address the problems asso¬ 
ciated with the setting of the regularization parameters e and 6. Third, we demonstrate 
in ^3.3| that conditional expectation functions are not generally contained in the RKHS, 
which means that Theorems 1, 2, 5, 6, 7, and 8 in Q do not work in some situations. 


§3.1 Relations between the posterior and the prior IT 


Let us review Theorem |2.9[ Assume that Gy and A are non-singular matrices. (This 

,) The matrix Rx\y = AGy((AGy)^ -i- 


6.2 


assumption is not so strange, as shown in 

Siy^A tends to Gy^ as 6 tends to 0. Furthermore, if we set d = 0 from the beginning. 


'Y ^ 

we obtain Rx\y - Gp^. This implies that the posterior kernel mean = k^Rx\YkY(y) 
never depends on the prior distribution Ft on A, which seems to be a contradiction to 
the nature of Bayes’ rule. This result is numerically confirmed in ^4.2| 


§3.2 The inverse of the operators Cxx and Cww 

As noted by Fukumizu et al. 0, the linear operators Cxx and Cww are not surjective 
in some usual cases, the proof of which is given in ^6.3[ Therefore, they proposed an 


alternative way of obtaining a solution / e *Hx of the equation Cxxf = that is, a 
regularized inversion / = {Cxx + as an analog of ridge regression, where e is a 

regularization parameter and I is an identity operator. One of the disadvantages of this 


method is that the solution / = {Cxx + f7) depends upon the choice of e. In 5 4.2 


we numerically show that the prediction using kernel Bayes’ rule considerably depends 
on the regularization parameters e and 6. Theorems 5,6,1, and 8 in Q seem to support 
the appropriateness of the regularized inversion. However, these theorems work under 
the condition that conditional expectation functions are contained in the RKHS, which 
does not hold in several cases as proved in ^3.3| Furthermore, since we need to assume 
sufficiently slow decay of the regularization constants e and 5 in these theorems, it is 
practically difficult to set appropriate values for e and 6. A cross-validation procedure 
seems to be useful for tuning the parameters and we may obtain good experimental 
results, however, it seems to lack theoretical background. 

Instead of the regularized inversion method, we can compute generalized inverse 
matrices of Gx and AGy, given a sample {(A, T,))”^j. Below, we briefly introduce a 
generalization of a matrix inverse. For more details, see Q. 

Definition 3.1. Let A be a matrix of size mxn over the complex number space C. We 
say that a matrix A^ of size n x m is a generalized inverse matrix of A if AA^A = A. We 
also say that a matrix A^ of size nxmis the Moore-Penrose generalized inverse matrix 
of A if AA^ and A^A are Hermitian, AA^A = A, and A^AA^ = A^. 


5 







Remark 3.2. In fact, any matrix A has the Moore-Penrose generalized inverse matrix 
A^. Note that A^ is uniquely determined by A. If A is square and non-singular, then 
A^ = A^ = A~^. For a generalized inverse matrix A^ of size nxm, AA^v = v for any 
vector V € C'" if v is contained in the image of A. In particular, A^v is a vector contained 
in the preimage of v under A. 


In the calculation of = k^Rxiykriy), we numerically compare the case Rx\y = 

^X\Y 


(A'Gy)^A' with the original case Rx\y = AGy((AGy)^ -i- 61) ^A in §4.2 where A' = 


diag(G],/nn). 


§3.3 Conditional expectation functions and RKHS 

In this subsection, we show that conditional expectation functions are in some cases not 
contained in the RKHS. 


Definition 3.3. For p € [1, oo), we define the spaces L^(K.), L^(R, C), and L^(R^, R) as 

' dx < oo 
' dx < oo 


LP(m.) := I / : 
LP(R,C) := I / : 


LP(R\R) := /: 


X oo 

\f{x)\P , 

cx? 

X oo 

\f{x)\P , 

oo 

X |/(xi,X 2 )l^ dx\dx 2 < oo \. 

2 J 


We also define fhe LP norm for / € LP{R) or / € L^(R, C) as 

ll/llp:-(J imiPdx'^'’, 

and fhe LP norm for / e L^(R^, R) as 

WfWp ■= \f{Xi,X2)\P dxidx2^ . 

Definition 3.4. For a function / € L^(R, C) n L^(R, C), we define ifs Fourier fransform 
as 

fit)--—— I /(x)exp(-V^tx)r/x. 

y2n J-oo 

We can uniquely extend fhe Fourier fransform fo an isomefry ": L^(R, C) —> L^(R, C). 
We also define fhe inverse Fourier fransform " : L^(R, C) —> L^(R, C) as an isomefry 
uniquely defermined by 


fit) - 


—— I /(x) exp( V^fx) r/x 
\ln J-oo 


for/€L^(R, C)nL2(R,C). 
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Definition 3.5. Let us define a Gaussian kernel kc on R by 


kG{x,y) := 


1 


V^i 


exp 


TTCT 


{x-yY 

Icr^ 


As described in Q, the RKHS of real-valued functions and complex-valued functions 
corresponding to the positive definite kernel kc are given by 


X O 

O 


•Kg := I / € l2(R) 

•Hg (R, C) := I / € l2(R, C) I J 


of exp (dt < oo 


0| expl dt < oo 


respectively, and the inner product of /, g e TYg or /, g e TYg (R, C) on the RKHS is 
calculated by 


(f,g)- 



f{t)git)exp 



dt. 


(V) 


where the overline denotes the complex conjugate. Remark that “Kg is a real Hilbert 
subspace contained in the complex Hilbert space 'Kg(R, C). 

Fukumizu et al. Q mentioned that the conditional expectation function E{g{Y) \ X - 
•] is not always included in 'Hx- Indeed, if the variables X and Y are independent, then 
£'[g(F) |X = •] becomes a constant function on X, the value of which might be non¬ 
zero. In the case that A = R and kx = kG, the constant function with non-zero value is 
not contained in 'Hx - 'ddc- 

Additionally, in order to prove Theorems 5 and 8 in Q, they made the assumption 
that £[ky(F, Y)\X = x,X^x]e'Hx®'k{x and E[kx(Z, Z)\W = y,W = y] eE(x/®E(^, 
where {X, T) and (Z, IT) are independent copies of the random variables {X, T) and 
(Z, W) on AxJ/, respectively. We also see that this assumption does not hold in general. 
Suppose that X and Y are independent and that so are X and F. Then E[k;^{Y, F) | A = 
X, A = jc] is a constant function of (x, x), the value of which might be non-zero. In 
the case that A = R and kx - kG, the constant function having non-zero value is not 
contained in Eix ® “Hv = "Hg ® "Kg- Note that “Kg ® "Kg is isomorphic to the RKHS 
corresponding to the kernel k((xi, xi), (xi, X 2 )) = ^gCxj, x^^gCxi, X 2 ) on R^, that is. 


EIg®^g = \ /eL2(R2,R) 


|/(L,G)f explyCfi -I- r2)j dtidt2 < 


where the Fourier transform of / : R^ —> R is defined by 


f(ti,t 2 ) — l.i.m. 

n—*oo 


( ^ I r /(Xi,X2)exp(-V^(tlXi -I-t2-^2)) < 5 txi<ix 2 . 

\ V^/ Jxj+x2<n ^ ’ 


Thus, the assumption that conditional expectation functions are included in the 
RKHS does not hold in general. Since most of the theorems in Q require this as¬ 
sumption, kernel Bayes’ rule may not work in several cases. 
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§4 Numerical experiments 


In this section, we perform numerical experiments to illustrate the theoretical results in 
^3.1 [ and ^3.2| We first introduce probabilistic classifiers in ^4.1 [ based on conventional 
Bayes’ rule assuming Gaussian distributions (BR), original kernel Bayes’ rule (KBRl), 
and kernel Bayes’ rule using Moore-Penrose generalized inverse matrices (KBR2). In 
^4.2[ we apply the three classifiers to a binary classification problem with computer- 
simulated data sets. Numerical experiments are implemented in version 2.7.6 of the 
Python software (Python Software Foundation, Wolfeboro Falls, NH, USA). 

§4.1 Algorithms of the three classifiers, BR, KBRl, and KBR2 

Let (X, F) be a random variable with a distribution P on A x d/, where A = {Ci,..., C^) 
is a family of classes and d/ = R^. Let Ft and Q be the prior and the joint distributions 
on A and A x d/, respectively. Suppose we have an i.i.d. training sample {(X,-, F,)}”^j 
from the distribution P. The aim of this subsection is to derive algorithms of the three 
classifiers, BR, KBRl, and KBR2, which respectively calculate the posterior probability 
for each class given an observation y € d/, that is, Qx\y{C \),..., QxjyiCg)- 

§4.1.1 The algorithm of BR 

In BR, we estimate the posterior probability of y-th class (j = 1,.. .,g) given a test 
value y e d/ by 

A n(c,) 

(Jx\y{^ j) ~ p — ’ 

I.LiPx\c,(y)^iCk) 

where is the density function of the cf-dimensional normal distribution N{Mj, S j) 

defined by 


Px\Cj{-) 


1 


/ 


exp 


Sj 





The mean vector Mj e R^ and the covariance matrix S ^ e R^ x R*^ are calculated from 
the training data of the class C,. 


§4.1.2 The algorithm of KBRl 

Let us define positive definite kernels kx and k\j as 


kx{X,X') 


1 (X = X') 
0 (X X') 


and kj/(F, Y') 


1 


exp 


llF-rip\ 
2(r2 ) 


for X, X' e A and F, Y' € d/, and the corresponding RKHS as "Plx and T-fy, respectively. 
Here we set ||F|| = for ^ = (yi,y 2 , ■ ■ ■ ,ydV e d/ = R^. Then, the prior kernel 










mean is given by 


g 

^n(-) = ^ n(C j)kx(-, Cj), 

7=1 

where n(Cy) ^ 1. Let us put Gx = {kxiXi,Xj))i<ij<n, Gy ^ 

D = aic.\iXj))i<i<g,i<j<n e {0,1)^^”, Bn = (mn(^i), ■ ■ .,BniXn)V,'p =(pu... ,7^nV = 
(Gx+nelnr'Bn, A - diag(A?), kx{-) = fe(-,^i), ■ ■ ., k^i- , X„))^, kyi-) = {k^{; Fj),.. .,k^{; F„))^, 
and Rx\y = AGy{{AGy)^ + 6In)~^X, where In is the identity matrix of size n and e, d e R 
are heuristically set regularization parameters. Note that stands for the indicator 
function of a set A described as 


1a(0 - 


1 (t € A) 

0 (t^A)' 


Following Theorem |2.9[ the posterior kernel mean given a test value y e J/ is estimated 
by 


'^Qx\y = k{Rx\yky(y). 


Here, we estimate the posterior probabilities for classes given a test value y € J/ by 


' QxiyiCi) ' 



, QxiyiCg) , 




= DRx\yky{y). 


§4.1.3 The algorithm of KBR2 


Let G'^ denote the Moore-Penrose generalized inverse matrix of Gx- Let us put Jt = 
(p\^,'jj'nf ^ G^niu, A' = diag(^), and = (A'Gy)'^ A'. Replacing/? x|f in 
for the posterior probabilities for classes given a test value y e 3/ is esti- 

by T 

{Qx\y(Ci), Qx\y(Cg)f - DR'^^ykyiy). 


i4.l.2 

rnated 


§4.2 Probabilistic predictions by the three classifiers 


In this subsection, we apply the three classifiers defined in § 4.1 fo a binary classificafion 
problem using compufer-simulafed dafa sefs, where X = {Ci,C 2 ) and 3/ = R^. In fhe 
firsl sfep, we independenfly generate 100 sefs of fraining samples wifh each fraining 
sample being {(X;, F,) € A x where Xi - Ci and F,- ~ N{Mi,S i) if 1 < / < 50, 

Xi - C 2 and F; ~ N{M 2 ,S 2 )‘if 5\ < i < 100, Mi = (1,0)\ M 2 = (0,1)\ and 


= ^2 - diag(0.1,0.1). Here, {F,}^^j and {F,)^°jj are sampled i.i.d. from N{M\,S 1 ) 
and N{M 2 ,S 2 ), respecfively. Individual F-values of one of fhe framing samples are 
plotted in Figure [T] 
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1st element of Y 


Figure 1: Individual F-values of a training sample 


With each of the 100 training samples and a simulated prior probability of Ci, or 
n(Ci) € {0.1,0.2,... ,0.9), the classifiers defined in ^4.1 [ estimate the posterior prob¬ 
ability of Cl given a test value y e {(0.5,0.5), (0.6,0.4), (0.7,0.3)), that is, 2;r|y(Ci). 
Figures |2j|5] show the mean (plus or minus standard error of the mean, SEM) of the 100 
values of Qx\y{C\) calculated by each of the classifiers, BR, KBRl, and KBR2. Here 
we show fhe case where cr in KBRl and KBR2 is fixed fo 0.1, and fhe regularizafion 
parameters of KBRl are sef fo be e = d = 10~^ (Figure]^, e = 6 = 10“^ (Figure]^, 
e - S - 10“^ (Figure 1^, and e = 6 = 10“^ (Figure [^. In Figures |2]|^ BR_lh illusfrafes 
fhe fheorefical resulf of BR, where M\, M 2 , Si, and 52 in BR are sef fo be Mi, M 2 , S 1 , 
and S 2 , respectively 


Consisfenf fo 


3.1 


Qx\y{Ci) calculated by KBRl is poorly influenced by n(Ci) 
compared wifh fhaf by BR when e and 6 are sef fo be small (see Figures and [^. 
In addition, Qx\y{Ci) calculated by KBR2 also seems fo be uninfluenced by n(Ci). 
When e and 6 are sef fo be larger, fhe effecf of n(Ci) on Qx\y{Ci) becomes apparenf in 
KBRl, however, fhe value of Qx\y(Ci) becomes loo small (see Figuresj^and These 
resulls suggesl lhal in kernel Bayes’ rule, fhe posterior does nol depend on fhe prior if 
e and 5 are negligible, which mighl be a conlradiclion fo fhe nalure of Bayes’ Iheorem. 
Moreover, even Ihough fhe prior affecls fhe poslerior when e and 6 become larger, fhe 
posterior seems loo much dependenl on e and 6, which are inilially defined jusl for fhe 
regularizafion of mafrices. 

We have also lesfed all possible combinalions of fhe following values for fhe param¬ 
eters in KBRl and/or KBR2: e e {10-\ 10“^ 10“^ 10“’, 10“Y 10'“, 10“'^ lO''^), 
S € {10“^ 10-^10-^10-^10-^10-“, 10-1^10-15), and cr € {0.01,0.1,1,10,100). 
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Figure 2: The case e = 6 = 10 


All the experimental results have been evaluated in a similar manner as above, and none 
of the results are found to be reasonable in the context of Bayesian inference. 


§5 Conclusions 


One of the important features of Bayesian inference is that it provides a reasonable 
way of updating the probability for a hypothesis as additional evidence is acquired. 
Kernel Bayes’ rule has been expected to enable Bayesian inference in RKHS. In other 
words, the posterior kernel mean has been considered to be reasonably estimated by 
kernel Bayes’ rule, given kernel mean expressions of the prior and likelihood. What is 
“reasonable” depends on circumstances, however, some of the results in this paper seem 
to show obviously unreasonable aspects of kernel Bayes’ rule, at least in the context of 
Bayesian inference. 


First, as shown in 5 3.1 when A and Gy are non-singular matrices and so we set 
d = 0, the posterior kernel mean is entirely unaffected by the prior distribution 
n on X. This means that, in Bayesian inference with kernel Bayes’ rule, prior beliefs 
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Figure 3: The case e - 6 = 10 


are in some cases completely neglected in calculating the kernel mean of the posterior 
distribution. Numerical evidence is also presented in ^4.2[ When the regularization pa¬ 
rameters e and 6 are set to be small, the posterior probability calculated by kernel Bayes’ 
rule (KBRl) is almost unaffected by the prior probability in comparison with that by 
conventional Bayes’ rule (BR). Consistently, when the regularized inverse matrices in 
KBRl are replaced for the Moore-Penrose generalized inverse matrices (KBR2), the 
posterior probability is also uninfluenced by the prior probability, which seems to be 
unsuitable in the context of Bayesian updating of a probability distribution. 

Second, as discussed in ^3.2| and ^4.2[ the posterior estimated by kernel Bayes’ rule 
considerably depends upon the regularization parameters e and 6, which are originally 
introduced just for the regularization of matrices. A cross-validation approach is pro¬ 
posed in Q to search for the optimal values of the parameters. However, theoretical 
foundations seem to be insufficient for the correct tuning of the parameters. Further¬ 
more, in our experimental settings, we are not able to obtain a reasonable result using 
any combination of the parameter values, suggesting the possibility that there are no ap¬ 
propriate values for the parameters in general. Thus, we consider it difficult to solve the 
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Figure 4: The case e = 5 = 10 ^ 


problem that Cxx and Cww are not surjective by just adding regularization parameters. 

Third, as shown in 53.3 the assumption that conditional expectation functions are 
included in the RKHS does not hold in general. Since this assumption is necessary 
for most of the theorems in Q, we believe that the assumption itself may need to be 
reconsidered. 

In summary, even though current research efforts are focused on the application of 
kernel Bayes’ rule @0, it might be necessary to reexamine its basic framework of 
combining new evidence with prior beliefs. 


§6 Appendix 

In this section, we provide some proofs for ^and ^ 

§6.1 Estimation of Czw swd Cww 

Here we give the proof of Proposition|2.8| 
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Figure 5: The case e = 5 = 10 ^ 


Proof. Let Cxx, C(yx)x, and C(yy)X denote the estimates of Cxx, C(yx)x, and C(yy)x, 
respectively. We define the estimates of m^zw) and m^ww) as 

>n(zw) - C(yx)xCJx"^n and m(vrw) - C(yy)xCxx^ii^ 

and put h = e “Hy- According to Equations (j^, for any / e 'Hx and g € ‘TFy, 

{m{zw), g ® />«,«, = (Cc-™'-. = £[/(»«( OWfl] 


1 ” / 1 ” 

= - y f{Xi)g{Yi)h(Xi) - - y /j(A0^^(-, AO ® TO, /' 

n \ n 


1=1 




1=1 


'AvSi'Hy 

where E[-] represents the empirical expectation operator. Thus, from Remark 


2.6 


- 1 
Czw - n^cziv) - - 
n 


n 

Y,h{Xi)kx{;Xi)®k^{;Yi). 

1=1 


( 8 ) 
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Similarly, for any g\,g 2 € T-fy, 

{m(ww),g\ = {CiYY)xh,gi = E[gi{Y)g2{Y)h{X)\ 

1 " /l ” \ 

= -Yj 8 i{yi) 82 {Yi)KXi) = {-Yj h(Xi)kA; Yi) ® k^{; Yi), gi ® g 2 > 

'=1 ' '=1 

Thus, from Remark |23| 

— 1 ” 

Cww - m{ww) - - V, h{Xi)k:g(-, Yi) ® kj/(-, F,). (9) 

n 

i=[ 

Next, we will derive h{Xi), ..., h{Xn). Since Cxx is a self-adjoint operator, 

/ 

{h,Cxxf)^^ = {Cxxhj)^^ = {muJ)<H^ = 

./=i 

for any / e 'Hx- On the other hand, from Equations ([T]), 

_ _ I « 

{h, Cxxf)^^ = E [f(X)h{X)] - - 2] f{Xi)h{Xi) 

i=l 

for any / e Elx- Hence, we have 

2] r;/(t/y) = - 2] RXi)h{Xi) (10) 

,/=l ” ;=1 

for any / € Elx- Replacing / in Equation ( [To| ) for kx{X\, kx{Xn, •) € E(x-> we 
have 


' kxiXuUi) •• 

• kxiXuUi)^ 

' n 

1 

- -Gx 
n 

' KXi) ' 

,kx{Xn,Ul) •• 

■ kxiXn, Ul) , 

. yi . 


, h{Xn) , 


Using Equation (|^, the left hand side of Equation ([TT|) is given by 




' Y!j^,yjkx{XuUj) ' 


' mn(Xi) ' 

. {l.‘=,7jkxi;Uj),kxi;X„))^^ 


, Z'=1 U,-) , 




Therefore, we have 


' h{X,) ' 




II 


, KX„) , 


, muiXn) j 


(Gx + nel) ^ mii= H- 


Replacing ^{h{X\),... ,/i(X„))^ for // = (^i,... ,iUn)^, Equations (j^ and (j^ become 

n n 

Czw = Jiikx{-,Xi) ® kj/(-, Yi) and Cww = 2^ Yi) ® Yi), 


i=l 


i=l 


respectively. 
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§6.2 Non-singularity of Gy and A 

Here we show that the assumption in ^3.1 [ holds under reasonable conditions. 

Definition 6.1. Let / be a real-valued function defined on a non-empty open domain 
Dom(/) c We say that / is analytic if / can be described by a Taylor expansion on 
a neighborhood of each point of Dom(/). 

Proposition 6.2. Let k be a positive definite kernel on Let v be a probability mea¬ 
sure on which is absolutely continuous with respect to Lebesgue measure. Assume 
that k is an analytic function on X R"^ and that the RKHS corresponding to k is in¬ 
finite dimensional. Then for any i.i.d. random variables Xi,X 2 , • • •, with the same 
distribution v, the Gram matrix Gx = ik{Xi,Xj))i<ij<n is non-singular almost surely 
with respect to = yxvx---xy('?i times). 

Proof. Let us put f{x\,X 2 , ...,x„):= det{k{xi, Xj))i<ij<n. Since the RKHS correspond¬ 
ing to k is infinite dimensional, there are ^ 1 ,^ 2 . ^ such that {k{-,fi))i<i<n are 

linearly independent. Then /(^i ,f2,---,fn)T0 and hence / is a non-zero analytic func¬ 
tion. Note that any non-trivial subvarieties of the euclidean spaces defined by analyfic 
funcfions have Lebesgue measure zero. By fhis facf, fhe subvariefy 

W) - {(xi,X 2 ,..., x„) e (R'^r I f{xux 2 ,..., x„) - 0 ) c (R^)” 

has Lebesgue measure zero. Since v is absolufely continuous, y”('T'(/)) = 0. This 
complefes fhe proof. □ 

From Proposition [ 6 ]^ we easily obfain fhe following corollary. 

Corollary 6.3. Let k be a Gaussian kernel on R'^ and let Xi,X 2 , ■. ■,X„ be i.i.d. ran¬ 
dom variables with the same normal distribution on R"^. Then the Gram matrix Gx - 
{k{Xi, Xj))\<ij<n is non-singular almost surely. 

Proposition 6.4. Let k be a positive definite kernel on X = W^, v a probability measure 
on X which is absolutely continuous with respect to Lebesgue measure. Assume that k is 
an analytic function onXxX and that the RKHS Td corresponding to k is infinite dimen¬ 
sional. Then for any {e, 71 , 72 , ■ ■ ■,yuU\, U 2 ,... ,Ui) € R+ X R^ X (R'^)^ except Lebesgue 
measure zero, and for any i.i.d. random variables Ai, A 2 ,... with the same distri- 
butionv, eachpifori = \,2,... ,n is non-zero almost surely, where {pi,p 2 ,... ,PnY - 
{Gx + ne4)“^mn, win ^ (nin(Ai),mn(A 2 ),..., win(A„))^, and mn(-) ^ JjK-, Uj). 

Here R+ denotes the set of positive real numbers. 

Proof. Lef us pul .S R+ x R^ x (R^)', T := X" xS, and 

fi{xi,X 2 ,... ,Xn,e,yi,y 2 ,... ,yi,Ui,U 2 ,... ,Ui) := pi {i = 1,2,...,n) 

for (xi,X 2 ,... ,x„) € A” and ( 6 , 71 , 72 , ...,yi,Ui, U 2 , e S. We can verify lhal 

Gx + nel„ - {k{xi,Xj))\<ij<„ -v nefi is non-singular almost everywhere on 7~ in the 
same way as in the proof of Proposition |6.2| Let us define a closed measure-zero sef 
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^ {{xuX 2 ,.. .,Xn,e,yi,y 2 ,...,yi,Ui,U 2 ,...,Ui)€T\ det(Gz + nein) ^ 0) c T". 

Then fi is defined on T" \ 'V for each / e {1,2,..., n). Using Cramer’s rule, 

_ det (?7i, 772 , • • ■, ?7i-i, mu, rjj+i,... ,T]n) 
det (Gx + nel„) 

where stands for the m-th column vector of Gx + neIn- Here we denote by gi the 
numerator of that is, g, = yu, det(Gx + nein)- Let us choose ^\,^ 2 , ■ ■ ■ ,^n ^ such 

that {k{-, ^i)}i<i<n are linearly independent in 91. It is easy to see that g,(^i, • • • > ^n, *) 

is a non-zero analytic function of * on S. Indeed, if e —> -i-O, Ui = ^i, yi = 1, and 

72 ^ 73 ^ ^ 7 / ^ 0, then gi det{{k{-,^i),k{-,^j))<H)\<i,j<n * 0. Hence Z; := {* e 

S I g;(^i, ^ 2 , • • ■, = 0} is a closed subset of S with Lebesgue measure zero for each 

/ e {1,2,... ,n). Thus, since g,(*, e, 71,725 ••• > 7 /> Li, 1 / 2 ,..., U/) is a non-zero analytic 
function of * on <Y” for any ( 6 , 71 , 72 ,.. .,yuU\, U 2 ,... ,Ui) ^ <S\ (U”^jZi)> 

Ti:=[ * € I g,-(*, 6 , 7 i, 72 ,.. .,yi,U\,U 2 ,.--,Ui) = 0 } 

is a closed subset of X" with Lebesgue measure zero for each i € {1,2,..., n). There¬ 
fore Hi = /;■(*, 6 , 71 , 72 ,... , 7 /, t/i, U 2 ,..., U/) is non-zero almost surely on <Y" for 
any ( 6 , 71 , 72 ,... , 7 /, Ui, U 2 ,..., U/) € 5 \ (U"^jZ;) because the subset {* e \ 
fi{*, 6 , 71 , 72 ,... , 7 /, Ui, U 2 , ...,Ui) - 0) is contained in Ti for each i € {1,2,... ,n}. 
This completes the proof. □ 

The following corollary directly follows from Proposition | 6 ^ 

Corollary 6.5. Let k be a Gaussian kernel on and let Xi,X 2 ,.. .,X„ be i.i.d. ran¬ 
dom variables with the same normal distribution on R'^. All other notations are as in 
Proposition \6.4\ Then A diag(yUi,yU 2 ,... ,Pn) ^ non-singular almost surely for any 
( 6 , 71 , 72 ,.. •, 7 /, Ui, U 2 ,..., U/) € R+ X R^ X (R^)^ except for those in a set of Lebesgue 
measure zero. 


§6.3 Non-surjectivity of Cxx and Cww 


The covariance operators Cxx and Cww are not surjective in general. This can be ver¬ 
ified by the fact that they are compact operators. (If the operators are surjective on the 
corresponding RKHS which is infinite-dimensional, then they cannot be compact be¬ 
cause of the open mapping theorem.) Here we present some easy examples where Cxx 
and Cww are not surjective. Let us consider for simplicity the case = R. Let X be 
a random variable on R with a normal distribution N{p, cr^). We prove that Cxx is not 
surjective under the usual assumption that the positive definite kernel on R is Gaussian. 
In order to demonstrate this, we use the symbols defined in ^3.3| and several proven 
results on function spaces and Fourier transforms (see Q, for example). Note that the 
following three propositions are introduced without proofs. 


Proposition 6.6. Let us put f(x) = exp(-(ax^ -e bx -e c))for a,b,c € R, where a > 0. 
Then 


m = 


1 



f - 2 sfAbt -b'^ + 4ac 
4a 
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Proposition 6.7. For f € L^(R, C), f{t) = f{-t) almost everywhere. In particular, if 
f € L^(]R), then f{t) - f{-t) almost everywhere. 

Proposition 6.8. For f € put fa{x) := f{x-a). Then fait) = exp(^- xT^at^fit). 


Definition 6.9. Let p(-) denote the density function of the normal distribution Nip, cTq) 
on R, that is, 


Pi-) 


x/Itto-q 


exp 


i--p)^ ] 

2crl )' 


Let X be a random variable on R with Nip, cr^). The linear operator Cxx ■ TTg —> TTg 
is defined by {Cxxf,g)'HG ~ for which is also described as 


iCxxDi-) 



fix)ki-, x)pix) dx 


for any / e FIg- 

Proposition 6.10. Iff, g e Ng, then (f, e 


Proof. From Proposition 6.7 fit) - fi-t) and git) = gi-t) for any /, g € FIg- Then, 
using Equation Q, we have 


-f 

-L 

-L 


</’ 8}'Hg = I f (1)8(1) exp I 1 r/f = I /(t)g(O exp [f-t^] dt 


/(-Og(t)exp 


a 


r dt = 


X O 

O 


2 


f(t)g(-t) exp 


2 


r dt 


fiOgit) exp \^f\dt = if, g)^^ 


Therefore, {/, € R. 

Proposition 6.11. If f € ‘Kg(R, C), then f € “HgCR, 


Proof. From Proposition 
we have 


6.7 


fit) = fi-t) for f^L\l 


Then, using Equation 


0’ 


/ 


‘Hg(R.C) 


(7,7> 


-L 


'Hc(R,C) 

I— 2 (cr^ 1, 

\fit) exp — r (it = 


2 


f 


fi-t) exp I y t ] dt 


O-^ 2 


01 explyt ) 7/t = II/I|;^^(R_C) 


< CX3. 


Therefore, / € F(Gi^ 
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Here, we denote by Re, Im, and Cl the real part of a complex number, the imaginary 
part of a complex number, and the closure operator, respectively. 


Corollary 6.12. If f e “HgCR-, C), then Re(/), Im(/) e ‘Hg- 
Proof. If / e ‘Kg(R, C), then / e -HgCIR, C) by Proposition 


6.11 


Hence we see that 


Re(/) = e -Kg, Im(/) = e <Hg. 


2V^ 


This completes the proof. 


Remark 6.13. If / e 'K g(R. C ), then there uniquely exist /i, /2 € “Kg such that / = 


/i + V ^/2 by Corollary 
denotes the direct sum. 


6.12 


This means that “HgCR, C) = “Kg ® V^'Kg, where 1 


Proposition 6.14. For any f € L^(R, C) and for any e > 0, there exists g € 'Kg(R., C) 
such that 11/ - g ||2 < 6. In other words, F(g(^, C) is dense in L^(R, C). 

Proof. Let Co(R, C) denote the space of continuous complex-valued functions with 
compact support on R. Let us define “KgCR, C) by 




■-ih€ L\R,i 


f 


\h{t)^ exp [ ) dt < 


Note that FIgIM., C) coincides with the image of “HgCR, C) by the Fourier transform. 
Then, Co(R, C) c -HgCR, C) c L^CR, C) and Cl(Co(R, C)) = L^CR, C). Hence CK-HgCR, C)) 
L^(R, C). In other words, for any / e L^(R, C) and for any e > 0, there exists 
g e “HgCR, C) such that \\f - g||2 < e because / € L^(R, C), which implies that there 
exists g e “HgCR, C) such that \\f - g ||2 < e. This completes the proof. □ 

The following corollary has also been shown in Theorem 4.63 in Q. 

Corollary 6.15. CICTFg) = L^(R). 


Proof. From Proposition 6.14 for any / € L^(R) c L^(R, C) and for any e > 0, there 
exists g € “HgCR, C) such that \\f - g \\2 < e. By Remark 6.13 there exist g\,g 2 e FIg 
such that g = -I- V^g 2 - Thus, 


e^>\\f-g\?2= f 

%J — O 


X oo 

(f-gi) - 

00 

X oo 

l/-gip dx = \\f-g,\\l. 

00 

Therefore \\f - gi ||2 < e. This completes the proof. 


g2 


dx 
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Definition 6.16. Let us define r, r„ e L^(R) as 

KO j^l(i,oo)(UI), rn{t) ■- j^l(i,„)(U|), 

where l(i,oo) and l(i,n) denote the indicator functions of the intervals (l,oo) and (l,n), 
respectively. We also put h„ := r„ and h ■= f. Note that lim r„ = r e L (R), because 

n—^oo 

^co ^ 

lim ||r„ - r\\\ = 2 lim I — dx = 0. 

n^oo n^oo 

Proposition 6.17. h„,h € L^(R). 

Proof. It is obvious that /i„, h e L^(R, C). Since r„ € L^(R) n L^(R), we see that 


hni.x) = r„(x) ^ 


—— I r„(t) exp ( dt 

J-co 


= —— I r„(t) exp (-V^tx) = —— ( r„(-t')exp 
J-oo -'-oo 

= —— I r„(t') exp ( V^t'x) dt' = fn{x) = hn{x), 

J-oo 

where f' = -t. Hence hn e L^(R). On the other hand, 


— \t'x^ dt' 


h{x) 


nil 

l.i.m. —— I r{t) exp ( xT^tx) dt 
^ J-n ^ ’ 

I ^oo 

l.i.m. —— I rn{t) exp ( r 

J-oo 


l.i.m. hn{x). 

n—^oo 


Therefore /z e L^(R). □ 

Let us define ka{-) ^'J^crkci-,a) = exp|-^^^^j € 'Hg for a € R. Now, we 
prove fhaf ka ^ Ran(Cxx) for any a e R. This implies that Cxx is not surjective. 


Proposition 6.18. For any a € R, e ‘Kg \ Ran(Cxx)- 

Proof. Suppose that there exists g e Kg such that Cxxg - ka- Then, for any / € Kg, 


{ka,f)>Uo - (CxXg,f}-Ho ■ 


( 12 ) 


(■-Of 


^^2 j- From Proposition 6.6 k{t) = crexp^-^t^^. 
Then, using Equation Q and Proposition [6]^ the left hand side or Equation ([T2]) equals 


Eet us put k{-) - ^/iTTO-kci-, 0) = exp 


f 


kait)f{f) exp (^ 


exp 


xT^at^ k{t)f{t) exp 


<T 


—r dt 


X oo _ _ 

exp(^- VKat)/C 

00 


t) dt. 
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The right hand side of Equation ( [T^ is equal to 

X OO 

gix)f{x)p{x) dx = (gP,/>L 2 (R) • 

00 


Thus, Equation ([T^ is equivalent to the following equation: 




(T 


X oo _ 

exp /( 

oo 


t) dt. 


( 13 ) 


Eet us define hn,a{x) hn{x - a) and ha{x) = h{x - a). Then hn^a, ha e L^(R). It is easy 
to see that ||/j„ « - ha \\2 = \\hn - Hh - Ikn - Ah ^ 0 as n ^ oo. Hence lim = ha in 


L^(]R). Since ^(t) = exp(- V^at)hn{t) by Proposition 


6.8 


we have 


J explyt^j = ^ explyt^j dt = ^ |r„(t)|^exp|Yf^j dt 


= 2^ ^exp(yr2| 


which indicates that hn,a e “Hg- Substituting hn,a for /, Equation ( [T3| ) becomes 

X oo _ 

exp (- V^af) hn,a{t) dt. (14) 

OO 

If n goes to infinity, the left hand side of Equation ( [T4| ) becomes {gp, ha)i 2 (^) € R. On 
the other hand, the right hand side of Equation ([T4]) becomes 


cr 


X OO _ r*<. 

exp V-4af^ exp(- yPAat)h„{f) dt - cr 

00 *J-~I 



hn{t) dt 

-00 


r»00 

rn{t) dt 


r 1 

= 2cr 

- dt oo (n ^ oo) 


Ji t 


This is a contradiction. Therefore, there exists no g e “Kg such that Cxxg - ka- This 
completes the proof. □ 


§7 Disclosures 

The second author was partially supported by JSPS KAKENHI Grant Numbers 23540044, 
15K04814. The authors declare no other conflicts of interest. 


References 

[1] Ee Song, Jonathan Huang, Alex Smola, Kenji Eukumizu. Hilbert space embeddings 
of conditional distributions with applications to dynamical systems. In Proceedings 
of the 26th Annual International Conference on Machine Learning, pages 961-968, 
2009. 


21 








[2] Kenji Fukumizu, Le Song, Arthur Gretton. Kernel Bayes’ rule: Bayesian inferenee 
with positive definite kernels. Journal of Machine Learning Research, 14:3753-3783, 
2013. 

[3] Le Song, Kenji Fukumizu, Arthur Gretton. Kernel embeddings of eonditional dis¬ 
tributions. IEEE Signal Processing Magazine, 30:98-111, 2014. 

[4] Motonobu Kanagawa, Yu Nishiyama, Arthur Gretton, Kenji Fukumizu. Monte 
Carlo filtering using kernel embedding of distributions. In Proceedings of the Twenty- 
Eighth AAAI Conference on Artificial Intelligence, pages 1897-1903, 2014. 

[5] Kenji Fukumizu. Introduetion to kernel methods (in Japanese). Asakura Shoten, 
Tokyo, 2010. 

[6] Roger A. Horn, Charles R Johnson. Matrix analysis, seeond edition. Cambridge 
University Press, Cambridge, 2013. 

[7] Walter Rudin. Real and eomplex analysis, third edition. McGraw-Hill Book Co., 
New York, 1987. 

[8] Ingo Steinwart, Andreas Christmann. Support veetor maehines. Information Sei- 
enee and Statisties. Springer, New York, 2008. 


22 



